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Abstract 

Reduplication, a central instance of prosodic mor- 
phology, is particularly challenging for state-of- 
the-art computational morphology, since it involves 
copying of some part of a phonological string. In 
this paper I advocate a finite-state method that com- 
bines enriched lexical representations via intersec- 
tion to implement the copying. The proposal in- 
cludes a resource-conscious variant of automata and 
can benefit from the existence of lazy algorithms. 
Finally, the implementation of a complex case from 
Koasati is presented. 

1 Introduction 

In the past two decades computational morphology 
has been quite successful in dealing with the chal- 
lenges posed by natural language word patterns. 
Using finite-state methods, it has been possible to 
describe both word formation and the concomi- 
tant phonological modifications in many languages, 
ranging from straightforward concatenative combi- 



nation (Koskenniemi, 1983) over Semitic-style non 



concatenative intercalation (Beesley (1996), Kiraz 
( 1994 )) to circumfixional long-distance dependen- 
cies ( [Beesley, 1998Q . 

However, Sproat ( 1992 ) observes that, despite 
the existence of "working systems that are capa- 
ble of doing a great deal of morphological analy- 
sis", "there are still outstanding problems and ar- 
eas which have not received much serious attention" 
(ibid., 123). Problem areas in his view include sub- 
tractive morphology, infixation, the proper inclu- 
sion of prosodic structure and, in particular, redu- 
plication: "From a computational point of view, one 
point cannot be overstressed: the copying required 
in reduplication places reduplication in a class apart 
from all other morphology." (ibid., 60). Productive 
reduplication is so troublesome for a formal account 
based on regular languages (or regular relations) 
because unbounded total instances like Indonesian 



noun plural (orang-orang 'men') are isomorphic to 
the copy language ww, which is context-sensitive. 

In the rest of this paper I will lay out a proposal 
for handling reduplication with finite-state methods. 



As a starting point, I adopt Bird & Ellison (Jl994j)'s 
One-Level Phonology, a monostratal constraint- 
based framework where phonological representa- 
tions, morphemes and generalizations are all finite- 
state automata (FSAs) and constraint combination 
is accomplished via automata intersection. While 
it is possible to transfer much of the present pro- 
posal to the transducer-based setting that is often 
preferred nowadays, the monostratal approach still 
offers an attractive alternative due to its easy blend 
with monostratal grammars such as HPSG and the 
good prospects for machine learning of its surface- 
true constraints (Ellison ( gggg ), Belz ( g99g )). 

After a brief survey of important kinds of redupli- 
cation in §|2|, section §|| explains the necessary ex- 
tensions of One-Level Phonology to deal with the 
challenges presented by reduplication, within the 
larger domain of prosodic morphology in general. 
A worked-out example from Koasati in §^| illus- 
trates the interplay of the various components in an 
implemented analysis, before some conclusions are 
drawn in section §|5[ 

2 Reduplication 

A well-known case from the context-sensitivity 
debate of the eighties is the N-o-N reduplicative 
construction from Bambara (Northwestern Mande, 



flCuly, 1985D ): 



(1) a. wulu-o-wulu 'whichever dog' 

b. wulunyinina-o-wulunyinina 
'whichever dog searcher' 

c. wulunyininafilela-o-wulunyininafilela 

'whoever watches dog searchers' 

Beyond total copying, [~Fj also illustrates the pos- 
sibility of so-called fixed-melody parts in redupli- 



cation: a constant lol intervenes between base (i.e. 
original) and reduplicant (i.e. copied part, in bold 
prrnt).Q 

The next case from Semai expressive minor redu- 
plication (Mon- Khmer, Hendricks ( 1998 )) high- 
lights the possibility of an interaction between redu- 
plication and internal truncation: 

(2) a. c?e:t ct-c?e:t 'sweet' 

b. drjoh dh-drph 'appearance of nod- 

ding constantly' 

c. cfa:l cl-cfa:l 'appearance of flick- 

ering red object' 

Reduplication copies the initial and final segment of 
the base, skipping all of its interior segments, which 
may be of arbitrary length. 

A final case comes from Koasati punctual-aspect 
reduplication (Muscogean, (Kimball, 1988)): 



(3) a. 



c. 



ta.has.pin txahas-txoi-pin 

'to be light in weight' 
la.pat.kin liapat-li6:-kin 

'to be narrow 
ak.lat.lin aik-hio-latlin 

'to be loose' 
okcakkon oik-hio-cakkon 
'to be green or blue' 



Koasati is particularly interesting, because it shows 
that copy and original need not always be adjacent 
- here the reduplicant is infixed into its own base - 
and also because it illustrates that the copy may be 
phonologically modified: the /h/ in the copied part 
of [3)[c,d is best analysed as a voiceless vowel, i.e. 
the phonetically closest consonantal expression of 
its source. Moreover, the locus of the infixed redu- 
plicant is predictable on prosodic grounds, as it is 
inserted after the first heavy syllable of the base. 
Heavy syllables in Koasati are long (C)VV or closed 
(C)VC. Prosodic influence is also responsible for 
the length alternation of its fixed-melody part /o(o)/, 
since the heaviness requirement for the penultimate, 
stressed, syllable of the word causes long [o:] iff the 
reduplicant constitutes that syllable. 



1 Culy (1985 ), who presents a superset of the data under ^ 
in the context of a formal proof of context-sensitivity, shows 
that the reduplicative construction in fact can copy the outcome 
of a recursive agentive construction, thereby becoming truly 
unbounded. He emphasizes the fact that it is "very productive, 
with few, if any restrictions on the choice of the noun" (p. 346). 



3 Finite-State Methods 

The present proposal differs from the state-labelled 
automata employed in One-Level Phonology by re- 
turning to conventional arc-labelled ones, but shares 
the idea that labels denote sets, which is advanta- 
geous for compact automata. 

3.1 Enriched Representations 

As motivated in §|2], an appropriate automaton repre- 
sentation of morphemes that may undergo redupli- 
cation should provide generic support for three key 
operations: (i) copying or repetition of symbols, (ii) 
truncation or skipping, and (iii) infixation. 

For copying, the idea is to enrich the FSA rep- 
resenting a morpheme by encoding stepwise repeti- 
tion locally. For every content arc i -A j we add a 

reverse repeat arc j — ► i. Following repeat arcs, 
we can now move backwards within a string, as we 
shall see in more detail below. 

For truncation, a similar local encoding is avail- 
able: For every content arc i A j, add another skip 

arc i — > j. This allows us to move forward while 
suppressing the spellout of c. 

A generic recipe for infixation ensures that seg- 
mental material can be inserted anywhere within 
an existing morpheme FSA. A possible representa- 

tional enrichment therefore adds a self loop % — > i 
labelled with the symbol alphabet S to every state i 
of the FSA.0 

Each of the three enrichments presupposes an 
epsilon-free automaton in order to be wellbehaved. 
This requirement in particular ensures that techni- 
cal arcs (skip, repeat) are in 1:1 correspondence 
with content arcs, which is essential for unambigu- 
ous positional movement: e.g. addskips(aeb) 
would ambiguously require 1 or 2 skips to supress 
the spellout of b, because it creates a disjunction of 
the empty string e with skip. It is perhaps worth 
emphasizing that there is no special interpretation 
whatsoever for these technical arcs: the standard au- 
tomaton semantics is unaffected. As a consequence, 
skip and repeat will be a visible part of the output 
in word form generation and must be allowed in the 
input for parsing as well. 

Taken together, the three enrichments yield an 
automaton for Bambara wulu, shown in figure [[[a. 
While skipping is not necessary for this example, 

4 — > 4 is: it will host the fixed-melody lol. The 

2 This can be seen as a n application of the ignore operator 
of Kaplan and Kay (1994), where E* is being ignored. 



repeat arcs will of course facilitate copying, as we 
shall see in a moment. 



a. 



b. 




Figure 1: Enriched automata for wulu (a.), Bambara 
N-o-N reduplication (b.) 

3.2 Copying as Intersection 



Bird & Ellison (1992) came close to discovering 



a useful device for reduplication when they noted 
that automaton intersection has at least indexed- 
grammar power (ibid., p.48). They demonstrated 
their claim by showing that odd-length strings of 
indefinite length like the one described by the 
regular expression (abode f g) + can be repeated 
by intersecting them with an automaton accept- 
ing only strings of even length: the result is 
(abode f g abode f g) + . 

Generalizing from their artifical example, let us 
first make one additional minor enrichment by tag- 
ging the edges of the reduplicative portion of a 
base with synchronization bits :1, while using 
the opposite value :0 for the interior part (see fig- 
ure |].a). This gives us a segment-independent 
handle on those edges and a regular expression 
seg.iseg.Q* seg.i for the whole synchronized portion 
(seg abbreviates the set of phonological segments). 

Assuming repeat-enriched bases, a total redupli- 
cation morpheme can now be seen as a partial word 
specification which mentions two synchronized por- 
tions separated by an arbitrary-length move back- 
wards: 

(4) seg-.iseg-.o* seg : ± repeat* seg-.iseg-.o* seg-.i 

Moreover, total reduplicative copying now simply 
is intersection of the base and (4), or - in the Bam- 
bara case - a simple variant that adds the lol (figure 
|j].b). Disregarding self loops for the moment, the 
reader may verify that no expansion of the kleene- 
starred repeat that traverses less than |6ase| seg- 
ments will satisfy the demand for two synchronized 
portions. Semai requires another slight variant of 
|(4)| which skips the interior of the base in the redu- 
plicant: 



(5) seg : i skip*seg : i repeat* seg-.iseg-.o* seg : i 

The identification of copying with intersection not 
only allows for great flexibility in describing the full 
range of actual reduplicative constructions with reg- 
ular expressions, it also reuses the central operation 
for constraint combination that is independently re- 
quired for one-level morphology and phonology. 
Any improvement in efficient implementation of 
intersection therefore has immediate benefits for 
grammar computation as a whole. In contrast, a 
hypothetical setup where a dedicated total copy de- 
vice is sandwiched between finite-state transducers 
seems much less elegant and may require additional 
machinery to detect copies during parsing. 

Note that it is in fact possible to compute 
reduplication-as-intersection over an entire lexicon 
of bases (see figure || for an example), provided that 
repeat arcs are added individually to each base. En- 
riched base FSAs can then be unioned together and 
undergo further automaton transformations such as 
determinization or minimization. This restriction 
is necessary because our finite-state method cannot 
express token identity as normally required in string 
repetition. Rather than identifying the same token, it 
addresses the same string position, using the weaker 
notion of type identity. Therefore, application of the 
method is only safe if strings are effectively isolated 
from one another, which is exactly what per-base 
enrichment achieves. See §^4 for a suggestion on 
how to lift the restriction in practice. 

3.3 Resource Consciousness 

One pays a certain price for allowing general repe- 
tition and infixation: because of its self loops and 
technical arcs, the automaton of figure [l[a over- 
generates wildly. Also, during intersection, self 
loops can absorb other morphemes in unexpected 
ways. A possible diagnosis of the underlying de- 
fect is that we need to distinguish between produc- 
ers and consumers of information. In analogy to 
LFG's constraint vs constraining equations, infor- 
mation may only be consumed if it has been pro- 
duced at least once. 

For automata, let us spend a P/C bit per arc, with 
P/C=l for producers and P/C=0 for consumer arcs. 
In open interpretation mode, then, intersection com- 
bines the P/C bits of compatible arcs via logical OR, 
making producers dominant. It follows that a re- 
source may be multiply consumed, which has obvi- 
ous advantages for our application, the multiple re- 
alization of string symbols. A final step of closed in- 



terpretation prunes all consumer-only arcs that sur- 
vived constraint interaction, in what may be seen 
as intersection with the universal producer language 
under logical- AND combination of P/C bits. 

Using these resource-conscious notions, we can 
now model both the default absence of material and 
purely contextual requirements as consumer-type 
information: unless satisfied by lexical resources 
that have been explicitly produced, the correspond- 
ing arcs will not be part of the result. By convention, 
producers are displayed in bold. Thus, the exact re- 
sult of figure [Tj.a n [j].b after closed interpretation is: 

w -.i u -.o l-.o u -.o ° repeat 4 " repeat* w : \ u : q l : o u : x 

This expression also illustrates that, for parsing, 
strings like wuluowulu need to be consumer-self- 
loop-enriched via a small preprocessing step, be- 
cause intersection with the grammar would other- 
wise fail due to unmentioned technical arcs such as 
repeat. Because our proposal is fully declarative, 
parsing then reduces to intersecting the enriched 
parse string with the grammar-and-lexicon automa- 
ton (whose construction will itself involve intersec- 
tion) in closed interpretation mode, followed by a 
check for nonemptiness of the result. Whereas the 
original parse string was underspecified for mor- 
phological categories, the parse result for a realis- 
tic morphology system will, in addition to technical 
arcs, contain fully specified category arcs in some 
predefined linearization order, which can be effi- 
ciently retrieved if desired. 

3.4 On-demand Algorithms 

It is clear that the above method is particularly at- 
tractive if some of its operations can be performed 
online, since a fullform lexicon of productive redu- 
plications is clearly undesirable e.g. for Bambara. I 
therefore consider briefly questions of efficient im- 
plementation of these operations. 



Mohri et al. (1998) identify the existence of a 
local computation rule as the main precondition^ 
for a lazy implementation of automaton operations, 
i.e. one where results are only computed when 
demanded by subsequent operations. Such imple- 
mentations are very advantageous when large in- 
termediate automata may be constructed but only a 
small part of them is visited for any particular in- 
put. They show that such a rule exists for composi- 



3 A second condition is that no state is visited that has not 
been discovered from the start state. It is easy to implement ^ 
so that this condition is fulfilled as well. 



tion o, hence also for our operation of intersection 
(A n B = range{identity{A) o identity (B))). 

Fortunately, the three enrichment steps all have 
local computation rules as well: 



(6) a. qi -> q 2 

b. qi q-i 

c. q =4> a - 



repeat 

<?2 — ► qi 

skip 

qi — > <?2 



The impact of the existence of lazy implementa- 
tions for enrichment operations is twofold: we can 
(a) now maintain minimized base lexicons for stor- 
age efficiency and add enrichments lazily to the cur- 
rently pursued string hypothesis only, possibly mod- 
ulated by exception diacritics that control when en- 
richment should or should not happen^] And (b), 
laziness suffices to make the proposed reduplication 
method reasonably time-efficient, despite the larger 
number of online operations. Actual benchmarks 
from a pilot implementation are reported elsewhere 



(Walther, submitted) 



4 A Worked Example 

In this section I show how to implement the Koasati 



case from (3) using the FSA Utilities toolbox (van 
Noord, 1997). FSA Utilities is a Prolog-based 
finite-state toolkit and extendible regular expression 
compiler. It is freely available and encourages rapid 
prototyping. 

Figure || displays the regular expression opera- 
tors that will be used (italicized operators are mod- 
ifications or extensions). The grammar will be pre- 



[] 

[El, E2, . . . , En] 
{E1,E2, ... ,En} 
E* 
E" 
El & E2 
X-> ( Y/Z) 

~ s 

Head(argl, . . . , argN) 
:= Body 



empty string 
concatenation of Ej 
union of Ej 
Kleene closure 
optionality 
intersection 
monotonic rule 
X <Z X / _Z 
complement set of S 
(parametrized) 
macro definition 



Figure 2: Regular expression operators 

sented below in a piecewise fashion, with line num- 
bers added for easy reference. 



4 See |Walther (submittec ) for further details. With determin- 
istic automata, the question of how to recover from a wrong 
string hypothesis during parsing is not an issue. 



Starting with the definition of stems (line 1), we 
add the three enrichments to the bare phonological 
string (2). However, the innermost producer- type 
string constructed by stringToAutomaton (3) is 
intersected with phonological constraints (5,6) that 
need to see the string only, minus its enrichments. 
This is akin to lexical rule application. 

1 stem (FirstSeg, String) := 

2 add_repeats (add_skips (add_self_loops ( 

3 [FirstSeg, stringToAutomaton ( String) ] 

4 & ignore_technical_symbols_in ( 

5 moraif ication&mark_f irst_heavy_sy liable 

6 & positional_classif ication) ) ) ) . 
7 

8 underspecif ied_f or_voicing (BaseSpec) := 

9 { producer (BaseSpec & vowel), 

10 [producer (h) , consumer (skip) ] }. 
11 

12 tahaspin := stem ( [ ] , "tahaspin"). 

13 aklatlin := stem (underspecif ied_for_ 

14 voicing (low) , "klatlin" ) ■ 



Lines 8-10 capture the V/h alternation that is char- 
acteristic for vowel-initial stems under reduplica- 
tion, with the vocalic alternant constituting the de- 
fault used in isolated pronunciation. In contrast, 
the /h/ alternant is concatenated with a consumer- 
type skip that requires a producer from elsewhere. 
Lines 12-14 define two example stems. 

The following constraint (15-18) enriches a 
prosodically underspecified string with moras 
fi - abstract units of syllable weight (H ayes, 
1995) -, a prerequisite to locating (2 0-2 4) and 
synchronization-marking (2 5-31) the first heavy 
syllable after which the reduplicative infix will be 
inserted. 



15 moraif ication := 

16 ( vowel — > ( mora / sigma ) )& 

17 ( consonant — > ( mora / consonant ) ) & 

18 ( consonant — > ( (" mora) / vowel ) ) . 
19 

20 first_(X) := [not_contains (X) , X]. 

21 heavy_rime := [consumer (mora) , 

22 consumer (mora) ] . 

23 heavy_syllable := [consumer (~ mora), 
2 4 heavy_rime] . 

25 mark_f irst_heavy_syllable := 

26 [ f irst_ (heavy_rime ) & synced_constituent , 

27 synced_constituent ] . 

28 right_synced := [consumer)"' rl'&seg) *, 

29 consumer (': 1 ' &seg) ] . 

30 synced_constituent := 

31 [ consumer (': 1 ' &seg) , right_synced] . 

32 positional_classif ication := 

33 [ consumer ( initial ), consumer (medial ) *, 

34 consumer ( final ) ] . 



Note that both the constituent before (t-iohas-i) 

and after (p.i i n : i) the infixation site need to be 
marked. Also, it turns out to be useful to classify 
base string positions for easy reference in the redu- 
plicative morpheme, which motivates lines 32-34. 

The main part now is the reduplicative morpheme 
itself (35), which looks like a mixture of Bambara 
and Semai: the spellout of the base is followed by it- 
erated repeats (3 6) to move back to its synchronized 
initial position (37), which - recall /h/ - is required 
to be consonantal. The rest of the base is skipped 
before insertion of the fixed-melody part /o(o)/ oc- 
curs (38, 42-44). Proceeding with the interrupted 
realization of the base, we identify its beginning as 
a synchronized syllable onset (~ mora), followed 
by aright-synchronized string (3 9-4 0). 



35 punctual_aspect_reduplication := 

36 [ synced_constituent , producer ( repeat )* , 

37 consumer (': 1 ' & initial & consonant), 

38 producer (skip) *, f ixed_melody , 

39 consumer (': 1 ' & seg & ~ mora), 

40 right_synced] . 
41 

4 2 fixed_melody := 

43 [producer (o & ~ ' :1' & medial & mora), 

44 producer (o & ~ ' : 1' & medial & mora) * ] . 



Finally, some obvious word.level.con- 
straints need to be defined (45-54), before the 
central intersection of stem and punctual-aspect 
reduplication (5 7) completes our Koasati fragment: 



45 word_level_constraints := 

4 6 last_segment_is_moraic & 

47 last_two_sylls_are_heavy . 
48 

4 9 last_segment_is_moraic := 

50 [consumer (sigma) *, consumer (mora) ] . 

51 

52 last_two_sylls_are_heavy := 

53 [ consumer ( sigma) *, 

54 heavy_syllable, heavy_syllable] . 
55 

56 wordf orm (Stem) : =closed_interpretation ( 

57 word_level_constraints & Stem & 

58 punctual_aspect_reduplication) . 



The result of wordform ({tahaspin, aklatlin}) 
is shown in figure 1 ( [ and ] are aliases for initial 
and final position). 

Space precludes the description of a final automa- 
ton operation called Bounded Local Optimization 
( Walther, 1999| ) that turns out to be useful here to 




Figure 3: Koasati reduplications tahas-too-pin, ak-ho(o)-latlin 



ban unattested free length variation, as found e.g. 
in ak-ho(o)-latlin where the length of o is yet to 
be determined. Suffice to say that a parametriza- 
tion of Bounded Local Optimization would prune 
the moraic arc 16 — > 19 in figure |3] by considering it 
costlier than the non-moraic arc 16 — * 18, thereby 
eliminating the last source of indeterminacy. 

5 Conclusion 

This paper has presented a novel finite-state method 
for reduplication that is applicable for both un- 
bounded total cases, truncated or otherwise phono- 
logically modified types and infixing instances. The 
key ingredients of the proposal are suitably en- 
riched automaton representations, the identification 
of reduplicative copying with automaton intersec- 
tion and a resource-conscious interpretation that 
differentiates between two types of arc symbols, 
namely producers and consumers of information. 
After demonstrating the existence of efficient on- 
demand algorithms to reduplication's central oper- 
ations, a case study from Koasati has shown that 
all of the above ingredients may be necessary in the 
analysis of a single complex piece of prosodic mor- 
phology. 

It is worth mentioning that our method can be 
transferred into a two-level transducer setting with- 
out major difficulties ( Walther, 1999[ appendix B). 

I conclude that the one-level approach to redu- 
plicative prosodic morphology presents an attractive 
way of extending finite-state techniques to difficult 
phenomena that hitherto resisted elegant computa- 
tional analyses. 
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